pacman::p_load(sf, sfdep, tmap, tidyverse, RColorBrewer, ggplot2, spatstat)Take Home Exercise 3
1. Overview
1.1 Introduction
1.2 My Responsibilities
- Data Preparation, Preprocessing
1.3 Importing Packages
Here, we have loaded the following packages:
2. The Data
For this project, we will be using the following data sets.
Singapore Resale Flat Prices (Jan-17 to Sep-24) from Kaggle, an accumulation of information relating to the sale of Singapore’s public housing apartments colloquially referred to as flats
This dataset augments the original dataset by including 4 important categories of information:
X/Y lat/lng coordinates, which can be used for geospatial plotting.
Information about the closest MRT station to the flat
Information about the closest primary school to flat
the URA planning area (or town) of the flat.
MP
HDB Property Info from data.gov.sg
2.1 Importing Geospatial Data
The code chunk below is used to import MP_SUBZONE_WEB_PL shapefile by using st_read() of sfpackages.
mpsz <- st_read(dsn = "data/geospatial/MasterPlan2014SubzoneBoundaryWebSHP/", layer = "MP14_SUBZONE_WEB_PL")Reading layer `MP14_SUBZONE_WEB_PL' from data source
`/Users/georgiaxng/georgiaxng/is415-handson/Take-home_Ex/Take-home_Ex03/data/geospatial/MasterPlan2014SubzoneBoundaryWebSHP'
using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
Here is a plot of Singapore.
plot(mpsz)2.2 Importing Aspatial Data
2.2.1 Importing Resale Flat Prices
The code chunk below is used to import the Resale Flat Prices dataset from Kaggle.
resale_df = read_csv('data/aspatial/resale_hdb_price_for_kaggle_2024-30sep.csv')To get a brief overview of existing columns of this dataset, we can use colnames() to do so.
colnames(resale_df) [1] "...1" "month"
[3] "storey_range" "floor_area_sqm"
[5] "flat_model" "lease_commence_date"
[7] "remaining_lease" "resale_price"
[9] "floor_area_sqft" "price_per_sqft"
[11] "blk_no" "road_name"
[13] "building" "postal"
[15] "address" "lease_commence_date_r"
[17] "planning_area_ura" "region_ura"
[19] "x" "y"
[21] "latitude" "longitude"
[23] "closest_mrt_station" "distance_to_mrt_meters"
[25] "transport_type" "line_color"
[27] "distance_to_cbd" "closest_pri_school"
[29] "distance_to_pri_school_meters"
2.2.1.1 CRS Adjustments
Another important step after importing the dataset is checking the coordinate system used, as seen in the result below using st_crs(), we can see that there is no CRS.
st_crs(resale_df)Coordinate Reference System: NA
Therefore, we need to convert the longitude and latitude columns into a spatial format. Since our dataset is based in Singapore and it uses the SVY21 coordinate reference system (CRS Code: 3414), we will use the st_transform() function to perform the conversion and create the geometry column.
resale_sf <- resale_df %>%
st_as_sf(coords = c("longitude", "latitude"), crs=4326) %>%
st_transform(crs = 3414)Using st_crs(), we can check and verify that the conversion is successful.
st_crs(resale_sf)Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
head(resale_df)# A tibble: 6 × 29
...1 month storey_range floor_area_sqm flat_model lease_commence_date
<dbl> <date> <chr> <dbl> <chr> <date>
1 0 2017-01-01 10 TO 12 44 Improved 1979-01-01
2 1 2017-01-01 01 TO 03 67 New Generati… 1978-01-01
3 2 2017-01-01 01 TO 03 67 New Generati… 1980-01-01
4 3 2017-01-01 04 TO 06 68 New Generati… 1980-01-01
5 4 2017-01-01 01 TO 03 67 New Generati… 1980-01-01
6 5 2017-01-01 01 TO 03 68 New Generati… 1981-01-01
# ℹ 23 more variables: remaining_lease <chr>, resale_price <dbl>,
# floor_area_sqft <dbl>, price_per_sqft <dbl>, blk_no <chr>, road_name <chr>,
# building <chr>, postal <chr>, address <chr>, lease_commence_date_r <date>,
# planning_area_ura <chr>, region_ura <chr>, x <dbl>, y <dbl>,
# latitude <dbl>, longitude <dbl>, closest_mrt_station <chr>,
# distance_to_mrt_meters <dbl>, transport_type <chr>, line_color <chr>,
# distance_to_cbd <dbl>, closest_pri_school <chr>, …
2.2.2 Importing HDB Property Information
hdb_info_df <- read_csv('data/aspatial/HDB Property Information.csv')2.3 Data Wrangling
2.3.1 Adding Housing Type for Flats
To assess whether the housing type influences the resale price of flats, we need to include this variable in our analysis. However, the current dataset lacks this information and thus we will be deriving it using other relevant variables. Specifically, we can use the floor area of the flats to determine the housing type, based on the guidelines provided on data.gov.sg.
2 Rm (2 room HDB Flat). 1 bedroom with a built-in area of about 45 sq m or 485 sq ft.
3 Rm (3 room HDB Flat). 2 bedrooms with a built-in area of about 70 sq m or 754 sq ft.
4 Rm (4 room HDB Flat). 3 bedrooms with a built-in area of about 90 sq m or 969 sq ft.
5 Rm (5 room HDB Flat). 3 bedrooms with a built-in area of about 110 sq m or 1,184 sq ft.
EA (Executive Apartment). 3/4 bedrooms with built-in area of about 150 sqm or 1,615 sqft.
EM (Executive Mansionette). Same as Executive apartment, except it has two levels.
6 Rm (6 room HDB Flat). Jumbo flat joint by two 3 room flats
Particularly, we will just group EA, EM and 6 rm flats into one category since there is no way for us right now to differentiate just by the size of the flats.
This chunk of code below derives the housing type for each flat by using conditional statements to check if the stated floor area is closer to that of a particular housing type and add the value in the new column created using mutate().
resale_sf <- resale_sf %>%
mutate(housing_type = case_when(
floor_area_sqft <= 619 ~ "2 Rm", # Closer to 485 sqft
floor_area_sqft > 619 & floor_area_sqft <= 862 ~ "3 Rm", # Closer to 754 sqft
floor_area_sqft > 862 & floor_area_sqft <= 1076 ~ "4 Rm", # Closer to 969 sqft
floor_area_sqft > 1076 & floor_area_sqft <= 1399 ~ "5 Rm", # Closer to 1184 sqft
floor_area_sqft > 1399 ~ "EA/EM/6 Rm", # Closer to 1615 sqft
))2.3.2 Grouping Remaining Lease By Range
First, let us convert the lease period from chr to total months. Below, we extract the different year and month from the string and then make the calculations to convert it to total months.
# Function to convert lease period to total months
convert_to_months <- function(lease) {
parts <- strsplit(lease, " ")[[1]] # Split by space
years <- as.numeric(parts[1]) # Extract years
# Check if months are present
if (length(parts) > 2) {
months <- as.numeric(parts[3]) # Extract months if present
} else {
months <- 0 # Set months to 0 if not present
}
total_months <- years * 12 + months # Convert to total months
return(total_months)
}
resale_sf <- resale_sf %>%
mutate(remaining_lease_total_months = sapply(remaining_lease, convert_to_months))Using the number of total months, we can then get a brief overview of the remaining lease of the resale flat including the mins and max. In this case, we can see that the minimum number of months is 497 (ard 41 years) and maximum is 1173 (around 97 years). This information will later be useful in setting the ranges for the remaining lease.
summary(resale_sf['remaining_lease_total_months']) remaining_lease_total_months geometry
Min. : 497.0 POINT :190724
1st Qu.: 756.0 epsg:3414 : 0
Median : 893.0 +proj=tmer...: 0
Mean : 894.4
3rd Qu.:1062.0
Max. :1173.0
Since the minimum is 41 years, our date range can start from 40-49 years onwards.
E.g. 40-49 years refers to 40 years 0 months to 49 years 11 months.
# Define the ranges and labels
breaks <- c(480, 600, 720, 840, 960, 1080, 1200)
labels <- c( "40-49 years", "50-59 years", "60-69 years", "70-79 years", "80-89 years", "90-99 years")
resale_sf <- resale_sf %>%
mutate(remaining_lease_range = cut(remaining_lease_total_months, breaks = breaks, labels = labels))2.3.3 Removal of Redundant Columns
# Define columns to be removed
columns_to_remove <- c("floor_area_sqm", "flat_model", "lease_commence_date", "remaining_lease", "blk_no", "road_name", "building", "address", "lease_commence_date_r", "postal","...1")
# Remove columns only if they exist in the dataframe
resale_sf <- resale_sf %>%
dplyr::select(-all_of(columns_to_remove[columns_to_remove %in% names(resale_sf)]))2.3.4 Adjustments to Storey Ranges
Through the histogram of the storey ranges, we can observe a right skewed distribution which shows that higher storeys are of a much lower frequency among resale flats. This is likely due to the fact that most HDB flats in Singapore are lower-rise buildings.
# Create a summary of counts for each remaining lease range
count_data <- resale_sf %>%
group_by(storey_range) %>%
summarise(count = n())
# Create the bar plot with labels
ggplot(count_data, aes(x = storey_range, y = count)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = count), vjust = -0.5, size = 4) + # Add labels on top of the bars
labs(title = "Count of Storey Range",
x = "Storey Range",
y = "Count") +
theme_minimal()Here is a summary of the maximum floor levels of all the HDB flats in Singapore. As shown here we can observe that both the mean and median maximum floor is a mere 12.
summary(hdb_info_df$max_floor_lvl) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 6.0 12.0 12.1 16.0 50.0
Thus, to simplify our analysis and make the visualization easier to interpret, we can combine the higher storey ranges into a single broader range. In particular, we will group the floors 16 or higher together to form the 16+ floors range.
resale_sf <- resale_sf %>%
mutate(storey_range_grouped = case_when(
storey_range %in% c("01 TO 03", "04 TO 06", "07 TO 09", "10 TO 12", "13 TO 15") ~ storey_range,
storey_range %in% c("16 TO 18", "19 TO 21", "22 TO 24", "25 TO 27", "28 TO 30", "31 TO 33", "34 TO 36", "37 TO 39", "40 TO 42", "43 TO 45", "46 TO 48", "49 TO 51") ~ "16+",
TRUE ~ storey_range
))
# Plot the grouped data
ggplot(resale_sf, aes(x = storey_range_grouped)) +
geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=-0.5) +
labs(title = "Count of Storey Range (Grouped)",
x = "Storey Range",
y = "Count") +
theme_minimal()2.3.5 Checking for NA values
2.3.6 Time Period Selection for Analysis
resale_5yr_sf <- resale_sf %>% filter (year(month) >= 2020)
write_rds(resale_5yr_sf,'data/rds/resale_5yr_sf.rds')2.4 Overview Of Dataset
colnames(resale_5yr_sf) [1] "month" "storey_range"
[3] "resale_price" "floor_area_sqft"
[5] "price_per_sqft" "planning_area_ura"
[7] "region_ura" "x"
[9] "y" "closest_mrt_station"
[11] "distance_to_mrt_meters" "transport_type"
[13] "line_color" "distance_to_cbd"
[15] "closest_pri_school" "distance_to_pri_school_meters"
[17] "geometry" "housing_type"
[19] "remaining_lease_total_months" "remaining_lease_range"
[21] "storey_range_grouped"
Explanatory Variables:
Continuous
Remaining Lease:
remaining_lease_total_monthsSize of flat:
floor_area_sqftDistance to transport:
distance_to_mrt_metersDistance to amenities:
distance_to_pri_school_metersDistance to central business district:
distance_to_cbd
Categorical
Remaining Lease:
remaining_lease_periodStorey Height:
storey_rangeHousing Type:
housing_type
Dependent Variables:
- Resale Price:
resale_price,price_per_sqft
3. Shiny Storyboard
4. Distribution
4.1 Categorical Variables
4.1.1 Remaining Lease Range
# Create a summary of counts for each remaining lease range
count_data <- resale_5yr_sf %>%
group_by(remaining_lease_range) %>%
summarise(count = n())
# Create the bar plot with labels
ggplot(count_data, aes(x = remaining_lease_range, y = count)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = count), vjust = -0.5, size = 4) + # Add labels on top of the bars
labs(title = "Count of Remaining Lease Ranges",
x = "Remaining Lease Range",
y = "Count") +
theme_minimal()4.1.2 Storey Range
# Create a summary of counts for each remaining lease range
count_data <- resale_5yr_sf %>%
group_by(storey_range_grouped) %>%
summarise(count = n())
# Create the bar plot with labels
ggplot(count_data, aes(x = storey_range_grouped, y = count)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = count), vjust = -0.5, size = 4) + # Add labels on top of the bars
labs(title = "Count of Storey Range",
x = "Storey Range",
y = "Count") +
theme_minimal()4.1.3 Housing Type
# Create a summary of counts for each remaining lease range
count_data <- resale_5yr_sf %>%
group_by(housing_type) %>%
summarise(count = n())
# Create the bar plot with labels
ggplot(count_data, aes(x = housing_type, y = count)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = count), vjust = -0.5, size = 4) + # Add labels on top of the bars
labs(title = "Count of Housing Type",
x = "Housing Type",
y = "Count") +
theme_minimal()Bivariate Analysis
Correlation Matrix
Drafts
lease_storey_table <- table(resale_5yr_sf$storey_range_grouped, resale_5yr_sf$remaining_lease_range)
lease_storey_df <- as.data.frame(lease_storey_table)
# Heatmap with ggplot
ggplot(lease_storey_df, aes(Var1, Var2, fill = Freq)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
labs(title = "Heatmap: Storey Range vs. Remaining Lease Range",
x = "Storey Range",
y = "Remaining Lease Range") +
theme_minimal()mpsz_main <- st_union(mpsz) %>%
st_cast("POLYGON")
mpsz_main <- mpsz_main[c(12)]
mpsz_owin <- as.owin(mpsz_main)plot(mpsz_owin)resale_2020_sf <- resale_5yr_sf %>% filter (year(month) == 2020)
resale_2021_sf <- resale_5yr_sf %>% filter (year(month) == 2021)
resale_2022_sf <- resale_5yr_sf %>% filter (year(month) == 2022)
resale_2023_sf <- resale_5yr_sf %>% filter (year(month) == 2023)
resale_2024_sf <- resale_5yr_sf %>% filter (year(month) == 2024)tmap_mode('plot')
tm_shape(mpsz%>% filter(PLN_AREA_N == 'ANG MO KIO'))+
tm_polygons()+
tm_shape(resale_2023_sf %>% filter(planning_area_ura == 'ANG MO KIO'))+
tm_dots()tmap_mode('plot')
tm_shape(mpsz%>% filter(PLN_AREA_N == 'ANG MO KIO'))+
tm_polygons()+
tm_shape(resale_2024_sf %>% filter(planning_area_ura == 'ANG MO KIO'))+
tm_dots()test <- resale_5yr_sf %>% filter(housing_type == 'EA/EM/6 Rm')
testSimple feature collection with 10649 features and 20 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11519.15 ymin: 28097.64 xmax: 42623.63 ymax: 48372.85
Projected CRS: SVY21 / Singapore TM
# A tibble: 10,649 × 21
month storey_range resale_price floor_area_sqft price_per_sqft
* <date> <chr> <dbl> <dbl> <dbl>
1 2020-01-01 01 TO 03 700000 1432. 489.
2 2020-01-01 07 TO 09 770000 1755. 439.
3 2020-01-01 07 TO 09 623000 1615. 386.
4 2020-01-01 04 TO 06 615000 1421. 433.
5 2020-01-01 07 TO 09 590000 1464. 403.
6 2020-01-01 13 TO 15 765000 1432. 534.
7 2020-01-01 04 TO 06 525000 1432. 367.
8 2020-01-01 13 TO 15 758000 1539. 492.
9 2020-01-01 07 TO 09 680000 1528. 445.
10 2020-01-01 13 TO 15 798888 1539. 519.
# ℹ 10,639 more rows
# ℹ 16 more variables: planning_area_ura <chr>, region_ura <chr>, x <dbl>,
# y <dbl>, closest_mrt_station <chr>, distance_to_mrt_meters <dbl>,
# transport_type <chr>, line_color <chr>, distance_to_cbd <dbl>,
# closest_pri_school <chr>, distance_to_pri_school_meters <dbl>,
# geometry <POINT [m]>, housing_type <chr>,
# remaining_lease_total_months <dbl>, remaining_lease_range <fct>, …